Lexical Features for Statistical Machine Translation
نویسندگان
چکیده
Title of dissertation: LEXICAL FEATURES FOR STATISTICAL MACHINE TRANSLATION Jacob Devlin, Master of Science, 2009 Dissertation directed by: Professor Bonnie Dorr Department of Computer Science In modern phrasal and hierarchical statistical machine translation systems, two major features model translation: rule translation probabilities and lexical smoothing scores. The rule translation probabilities are computed as maximum likelihood estimates (MLEs) of an entire source (or target) phrase translating to a target (or source) phrase. The lexical smoothing scores are also a likelihood estimate of a source (target) phrase translating to a target (source) phrase, but they are computed using independent word-to-word translation probabilities. Intuitively, it would seem that the lexical smoothing score is a less powerful estimate of translation likelihood due to this independence assumption, but I present the somewhat surprising result that lexical smoothing is far more important to the quality of a state-of-the-art hierarchical SMT system than rule translation probabilities. I posit that this is due to a fundamental data sparsity problem: The average word-to-word translation is seen many more times than the average phrase-to-phrase translation, so the wordto-word translation probabilities (or lexical probabilities) are far better estimated. Motivated by this result, I present a number of novel methods for modifying the lexical probabilities to improve the quality of our MT output. First, I examine two methods of lexical probability biasing, where for each test document, a set of secondary lexical probabilities are extracted and interpolated with the primary lexical probability distribution. Biasing each document with the probabilities extracted from its own first-pass decoding output provides a small but consistent gain of about 0.4 BLEU. Second, I contextualize the lexical probabilities by factoring in additional information such as the previous or next word. The key to the success of this contextdependent lexical smoothing is a backoff model, where our “trust” of a contextdependent probability estimation is directly proportional to how many times it was seen in the training. In this way, I avoid the estimation problem seen in translation rules, where the amount of context is high but the probability estimation is inaccurate. When using the surrounding words as context, this feature provides a gain of about 0.6 BLEU on Arabic and Chinese. Finally, I describe several types of discriminatively trained lexical features, along with a new optimization procedure called Expected-BLEU optimization. This new optimization procedure is able to robustly estimate weights for thousands of decoding features, which can in effect discriminatively optimize a set of lexical probabilities to maximize BLEU. I also describe two other discriminative feature types, one of which is the part-of-speech analogue to lexical probabilities, and the other of which estimates training corpus weights based on lexical translations. The discriminative features produce a gain of 0.8 BLEU on Arabic and 0.4 BLEU on Chinese. LEXICAL FEATURES FOR STATISTICAL MACHINE TRANSLATION
منابع مشابه
A Hybrid Machine Translation System Based on a Monotone Decoder
In this paper, a hybrid Machine Translation (MT) system is proposed by combining the result of a rule-based machine translation (RBMT) system with a statistical approach. The RBMT uses a set of linguistic rules for translation, which leads to better translation results in terms of word ordering and syntactic structure. On the other hand, SMT works better in lexical choice. Therefore, in our sys...
متن کاملThe Correlation of Machine Translation Evaluation Metrics with Human Judgement on Persian Language
Machine Translation Evaluation Metrics (MTEMs) are the central core of Machine Translation (MT) engines as they are developed based on frequent evaluation. Although MTEMs are widespread today, their validity and quality for many languages is still under question. The aim of this research study was to examine the validity and assess the quality of MTEMs from Lexical Similarity set on machine tra...
متن کاملIntegrating Source-Language Context into Log-Linear Models of Statistical Machine Translation
The translation features typically used in state-of-the-art statistical machine translation (SMT) model dependencies between the source and target phrases, but not among the phrases in the source language themselves. A swathe of research has demonstrated that integrating source context modelling directly into log-linear phrasebased SMT (PB-SMT) and hierarchical PB-SMT (HPB-SMT), and can positiv...
متن کاملSupertags as Source Language Context in Hierarchical Phrase-Based SMT
Statistical machine translation (SMT) models have recently begun to include source context modeling, under the assumption that the proper lexical choice of the translation for an ambiguous word can be determined from the context in which it appears. Various types of lexical and syntactic features have been explored as effective source context to improve phrase selection in SMT. In the present w...
متن کاملDocument-level Re-ranking with Soft Lexical and Semantic Features for Statistical Machine Translation
We introduce two document-level features to polish baseline sentence-level translations generated by a state-of-the-art statistical machine translation (SMT) system. One feature uses the word-embedding technique to model the relation between a sentence and its context on the target side; the other feature is a crisp document-level token-type ratio of target-side translations for source-side wor...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009